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Abstract. In this paper we present a novel graph kernel framework 
inspired the by the Weisfeiler-Lehman (WL) isomorphism tests. Any 
WL test comprises a relabelling phase of the nodes based on test-specific 
information extracted from the graph, for example the set of neighbours 
of a node. We defined a novel relabelling and derived two kernels of 
the framework from it. The novel kernels are very fast to compute and 
achieve state-of-the-art results on five real-world datasets. 


1 Introduction 

In many real world learning problems, input data are naturally represented as 
graphs [TS] . A typical approach for solving machine learning tasks on struc¬ 
tured data is to project the input data onto a vectorial feature space and then 
perform learning on such space. Ideally, a good projection should ensure non¬ 
isomorphic data to be represented by different vectors in feature space, i.e. to 
be injective. When high dimensional data, such as graphs, is involved, specific 
challenges arise, especially from the computational point of view. 

Kernel methods are considered to be among the most successful machine 
learning techniques for structured data. They replace the explicit projection in 
feature space with the evaluation of a symmetric semidefinite positive similarity 
function, called the kernel function. A major advantage of kernel methods is 
that very large, possibly infinite, feature spaces can be utilized by the learning 
algorithm with a computational burden dependent on the complexity of the 
kernel function and not on the size of the feature space. Unfortunately, any kernel 
function for graphs, whose correspondent feature space projection is injective, is 
as hard to compute as deciding whether two graphs are isomorphic [5], which is 
believed to be a NP-Hard problem. 

As a consequence, in order to have computationally tractable kernel func¬ 
tions for graph data, a certain amount of information loss is inevitable. Most 
kernel functions for graphs associate specific types of substructures to features. 
The evaluation of the kernel function is then related to the number of common 
substructures between two input graphs. Such substructures include walks m 
m m. paths m 0, specific types of subgraphs 0 [ig and tree structures 0. 
Such kernels, with the exception of the ones in and [3], are computationally 
too demanding to be used with large datasets and are effective when the corre¬ 
spondent features are relevant for the current task. Recently, the Fast Subtree 


Kernel has been proposed M- It has linear complexity (in the number of edges) 
and its features are subtree patterns of the input graphs. The kernel computes 
a rough approximation of the one-dimensional Weisfeiler-Lehman isomorphism 
test m, with the explicit goal of being fast to compute. 

In this paper we present two kernel functions for graphs inspired by extensions 
of the Weisfeiler-Lehman isomorphism test. We define kernels whose feature 
space is much larger than the Fast Subtree Kernel with a modest increase in 
computational complexity. 


2 Weisfeiler-Lehman Isomorphism Test and Extensions 

Some notation is first introduced. A graph is a triplet G = {V,E,L), where V 
is the set of nodes and \V\ its cardinality, E the set of edges and L{) a function 
returning the label of a node. A graph is undirected if {vi, Vj) G E ^ {vj,Vi) G E, 
otherwise it is directed. A path of length n — 1 in a graph is a sequence of distinct 
nodes vi, ... ,Vn such that {vi,Vi + 1) G E for 1 < i < n ; ii vi = Vn the path 
is a cycle. The distance d{vi,Vj) between the nodes Vi,Vj is the length of any 
shortest path connecting them. 

We can now describe the Weisfeiler-Lehman isomorphism test and a few ex¬ 
tensions m i, which are all based on a relabelling process of the nodes of a 
graph G = {V,E,L). We introduce two functions which, instantiated, deter¬ 
mine the isomorphism test: tt{G,v), where v G V, and h{) with the constraint 
that the codomain of Tr{G,v) must coincide with the domain of h{). The role 
of tt(G,v) is to extract specific information from G: for example in the one¬ 
dimensional WL (1-dim WL) test 7r(G,'!;) extracts the set of neighboring nodes 
of v: tt{G,v) = {m|m G V,d{u,v) = 1}. The function h{) associates a unique nu¬ 
merical value (colour in the mathematical jargon) to each 7r(G, v) and h^ir^G, f)) 
will be used as novel label for v. In order for h{) to be well defined, a canonical 
representation for elements in its domain has to be defined, which practically 
boils down to defining a partial ordering between 7r(G, v) elements. For example, 
in the 1-dim WL test the elements of 7r(G, v) are sorted alphabetically according 
to their labels. 

The algorithm for computing the isomorphism test proceeds by iteratively 
relabelling G nodes by means of a family of functions (): 

Ll{v) = h{7T{G^-\v)), (1) 

where G° = {V,E,L) and G* = {V,E,L\) for i > 0. The functions are 

constructed for all i < i*, where i* is the lowest index for which, Vw G V, 
L\ (v) = Note that i* < \V\ for the 1-dim WL test [13]. By applying 

the relabelling in eq. 0 to graphs G and G' , we obtain two multisets of node 
labels: ('y)li’ € V} and {L^ {v')\v' G V'}. If such multisets are different, 

then the two graphs are not isomorphic. On the contrary, if the two multisets 
are identical, there is not enough information to tell whether the two graphs are 
isomorphic. 


Extensions to the 1-dim WL test have been proposed to increase the discrimi¬ 
native power of the test. Their idea is to enrich the type of information used in the 
relabelling phase The extension proposed by Miyazaki [12] considers the 

colour of the nodes up to distance K: Tr{G,v) = {{l,u)\u G V,d{v,u) = I < K}-, 
tt{G,v) elements, i.e. the tuples {l,u), are ordered according to the relation 
{l,u) < I < dy {I = I' A L,^{u) < Lt^{u')), where L,^{) is a generic 

labelling function. In the extension of Oliveira et al. [S], h{) is defined on paths, 
which are ordered according to the sequence of labels of the nodes in the path. 
Specifically, 7r(G, u) extracts, for each u G V, the shortest path between v,u 
having lower ft,() value: let s(y,u) be the set of shortest paths connecting u and 
V, 7r(G,u) = U„evargminpgs(„_„) h{p). 

3 Weisfeiler-Lehman kernel framework 

Let us consider a function TTr{G, v) depending on a parameter r, with 1 < r < K. 
Given a graph G = {V,E,L), the application of eq. Q, for a fixed r value at 
the *-th iteration, yields the graph G* = G{V,E,L\ ), which differs from the 
original graph only in the labelling function. 

Definition 1. Let fc() he any kernel for graphs that we will refer to as the base 
kernel. Then the Extended Weisfeiler-Lehman kernel with h iterations, depth K 
and base kernel fc() is defined as: 

K h 

WL^{G,G') = Y,Y.'^(^r,G;). ( 2 ) 

r=l i=0 

Since the functions in eq. Q are well defined and the Extended Weisfeiler- 
Lehman kernel of eq. ([^ is a finite sum of positive semidefinite functions, it is 
also positive semidefinite. 

Let us now present the main contribution of the paper, i.e. two novel kernels 
which are instances of eq. (§• For both kernels the function tt{G,v) returns 
the following Directed Acyclic Graph (DAG) rooted at v: Dr{v) = {Vr,Er,L) 
where Vr = {u G V\d{v,u) < r} and Er consists in all edges of G that appear 
in any of the shortest path connecting v and any u G Vr (see Fig. [^b for an 
example). In order to have a canonical representation for the DAG Dr{v), the 
ordering for DAG nodes described in |S| is used. The function h{) assigns a unique 
numerical value to each DAG, and it can be implemented efficiently as presented 
in |3]. Let the maximum number of nodes of each DAG Driy) be \Dr\. Then it 
can be shown that \Dr\ is 0{p'~) [S], where p is the maximum node outdegree. 
Gomputing all the indices L)r,.() for a graph G has worst-case time complexity 
0(|Dr| |E| log |Dr| |E|) (see |S] for details). Assuming p constant (a condition that 
usually holds in real-world datasets) the worst-case time complexity reduces to 
0(|E|log|E|). 

In the first proposed kernel, that we will refer to as WLms-ddk, the base 
kernel is defined as 

fc(G;,G:*)=^ ^ S{Ll^{v),Ll^{v')), 

vSiV v'ev 


( 3 ) 




@ @ 



Fig. 1. Steps for obtaining some of the features of the WLddk kernel: a) an input 
graph G; b) the DAG resulting from the application of 7r(G, v) where v is the node 
labelled as s; c) the tree visit T(v); d) the features of the ST kernel related to T(v). 


where S is the Kronecker’s delta function. Note that computing the kernel is 
equivalent to performing a hard match between the DAGs encoded by (u) 
and L\^{v'). If we order the list of indices G F} and G V'}, 

then eq. ^ can be computed in 0{\V\ log |y|) time. 

The second kernel we propose, referred to as WLddk, differs from the first 
one only in the base kernel fc(). Let T{v) be the function that, first computes 
the DAG TTr{G, v) and then returns the tree resulting from the breadth-first visit 
of the DAG starting from v (see Fig. [^c for an example). Finally, k{) can be 
defined as any kernel for trees applied to T(v) and T(v'), for example the subtree 
kernel (ST) 

^St{T{v),T{v')). ( 4 ) 

vGV v'GV 

The ST kernel counts the number of matching proper subtrees of T(v) and T(v'), 
where a proper subtree of a tree T rooted at u is the subtree composed by u and 
all of its descendants (in Fig. [I}d are listed the set of proper subtrees of the tree 
in Fig.[^c). The complexity of ksT{T,T') is O(nlogn) where n = min(|T|, |r'|). 
Assuming p constant, 0(|T(u)|) = 0{\Dr{v)\). By using the algorithm described 
in [5j, the complexity of computing eq. @isO(|F|log|F|). 

There are a number of kernels in literature that are instances of eq. ([^. 
The Fast Subtree Kernel (FS) counts the number of identical subtree pat¬ 
terns of depth h m- It can be obtained from eq. Q by setting: i) K = I; 
ii) 7r(G,u) = {u\u G V,d{v,u) = 1} and then ordering 7r(G,u) elements alpha¬ 
betically according to their labels; in) the base kernel k{) is the one in eq. (§. 
The ODD-ST?,, described in [S], is an instance of the WLddk of eq.Q and it 
is obtained setting h = 0 in eq. ([^ . 

4 Experimental results 

In this section, we compare the two kernels presented in Section against other 
state-of-the-art kernels on five real-world datasets. 

We considered the Fast Subtree kernel [H], the ODD-ST/, kernel (de¬ 
scribed in section]^ and the NSPDK kernel [5] , that computes the exact matches 
between pairs of subgraphs with controlled size and distance. For the assess¬ 
ment of the performance of the proposed kernels, we considered five real-world 


Kernel 

CAS 

CPDB 

AIDS 

NCIl 

GDD 

AVG Rank 

FS 

81.05 (5) 

(± 0 . 50 ) 

73.22 (5) 

(± 0 . 78 ) 

75.61 (5) 
(±1.00) 

84.77 (3) 

(± 0 . 31 ) 

76.21 (2) 

(± 1 . 15 ) 

4 

NSPDK 

83.60 (2) 

(± 0 . 34 ) 

76.99 (2) 

(± 1 . 15 ) 

82.71 (3) 
(±0.66) 

83.46 (4) 

(± 0 . 46 ) 

74.09 (5) 

(± 0 , 91 ) 

3.2 

ODD - STh 

83.34(3) 

(± 0 . 31 ) 

76.44 (4) 
(± 0 . 62 ) 

81.51(4) 

(± 0 . 74 ) 

82.10 (5) 

(± 0 . 42 ) 

75.23(4) 

(± 0 . 70 ) 

4 

WLns-ddk 

82.96 (4) 

(± 0 . 49 ) 

77.03 (1) 
(± 1 - 18 ) 

82.80 (2) 
(±0.66) 

84.79 (2) 

(± 0 . 36 ) 

77.20 (1) 

(± 0 , 65 ) 

2 

WLddk 

83.91 (1) 

(± 0 . 29 ) 

76.52 (3) 
(± 1 . 16 ) 

82.93(1) 

(± 0 . 71 ) 

84.90 (1) 

(± 0 . 33 ) 

75.45 (3) 
(±0.86) 

1.8 


Table 1. Average accuracy results ± standard deviation in nested 10-fold cross val¬ 
idation for the Fast Subtree, the Neighborhood Subgraph Pairwise Distance, the 
Kodd-st^, WLns-ddk and WLddk kernels obtained on CAS, CPDB, AIDS, NCIl 
and GDD datasets. The rank of the kernel is reported between brackets. 


datasets: CA^ CPDB [TO], AIDS [00], NCIl [TH| and GDD |7]. All the datasets 
represent binary classification problems. The first four datasets involve chemical 
compounds, represented as graphs where the nodes represent the atoms (labelled 
according to the atom type) and the edges the bonds between them. In chemical 
compounds, there are no self-loops. GDD is a dataset of proteins, where each 
protein is represented by a graph, in which the nodes are amino acids and two 
nodes are connected by an edge if they are less than 6° Angstroms apart. CAS 
and NCIl are the largest datasets, with 4337 and 4110 examples, respectively. 
For more information about the datasets, please refer to |S]. 

All the kernels have been employed together with a Support Vector Machine. 
The C parameter of the SVM has been selected in the set {0.01, 0.1,1,10,100}. 
For all the experiments, the values of the parameters of the ODD-ST;, kernel 
have been restricted to: = {1, 2,..., 8} A = {0.1,0.2,..., 2.0} (A is a parameter 
of Kst)] for the Fast Subtree Kernel we optimized the only parameter of the 
kernel h = {1,2,..., 10} ; for the NSPDK kernel we optimized the parameters 
r = {1,2, ...,8} and d = {1,2,..., 8}. Concerning the two kernels presented 
in this article, their parameters are K = {1,2, 3,4} , h = {0,1, 2,..., 8} and 
A = {0.1,0.2,..., 2.0}. The parameters range has been selected in such a way 
that the computational time needed for the calculation of the kernel matrices 
is roughly comparable, i.e. at most one hour on a modern PC. For parameter 
selection we adopt a technique commonly referred to as nested K-fold cross 
validation following El- All the experiments have been repeated 10 times and 
the average results (with standard deviation) are reported. 

Table ^ summarizes the average accuracy results of the proposed kernels 
and the state-of-the-art ones on the considered datasets. The mean accuracy 
is reported with the standard deviation. Between brackets, the ranking of the 


^ http://www.cheminformatics.org/datasets/bursi 
















Gram matrix computation for NCI1 dataset 



Kernel parameter 


Fig. 2. Comparison between the time needed for computing the Gram matrix on the 
NCIl dataset for the different kernels, as a function of the parameter: h for FS and 
WLns-ddk, K for ODD - ST, r for NSPDK. 


specific kernel on the dataset is reported. In the rightmost column, the average 
ranking value on all the datasets for each kernel is reported. When considering 
single datasets, there is no dataset where NSPDK or FS kernels rank first. On 
all the considered datasets, either WLms-ddk or WLjjdk outperforms the 
other kernels. If we look at the average ranking, the situation is clearer. The 
best average ranking of the competing kernels is the one of NSPDK, with a 
value of 3.2. The WLpfs-DDK has an average ranking of 2. performs 

slightly better, with an average ranking value of 1.8. These results clearly show 
that, on the considered datasets, the WL kernel family performs better than the 
other kernels present in literature. 

Figure [^reports the computational time, in seconds, needed from the 
WL]sis-ddk kernel and the competing ones to compute the Gram matrix for 
the NCIl dataset. The computation time required by WLodk is very similar 
and thus omitted. 


5 Conclusions and future work 

This paper proposed a new framework for the definition of graph kernels based 
on a generalization of the 1-dimensional WL test. The framework can be instan¬ 
tiated with any kernel for graphs as a base kernel. In particular, we analyzed 
two instances inspired by the Decompositional DAGs graph kernels The two 
kernels show state-of-the-art predictive performance on five real-world datasets, 
with a computational burden that, on such datasets, grows only linearly with re- 











spect to the kernel parameters. As a future work, we will explore other members 
of the framework. 
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